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Conjunctive Bayesian networks (CBNs) are graphical models that describe the accumulation 
of events which are constrained in the order of their occurrence. A CBN is given by a partial 
order on a (finite) set of events. CBNs generalize the oncogenetic tree models of Desper et al. by 
allowing the occurrence of an event to depend on more than one predecessor event. The present 
paper studies the statistical and algebraic properties of CBNs. We determine the maximum 
likelihood parameters and present a combinatorial solution to the model selection problem. Our 
method performs well on two datasets where the events are HIV mutations associated with drug 
resistance. Concluding with a study of the algebraic properties of CBNs, we show that CBNs 
are toric varieties after a coordinate transformation and that their ideals possess a quadratic 
Grobner basis. 
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1. Introduction 

The conjunctive Bayesian network (CBN) model on a finite partially ordered set (poset) 
was introduced in Becrcnwinkel et al. ([7], Section 4) as well as in the form of noisy- AND 
models in the Al literature (e.g., Pearl [20]). Here, we give a self-contained study of the 
statistical and algebraic properties of this model. CBNs are specializations of Bayesian 
networks. They include the oncogenetic (also called mutagenetic) tree models of Dcspcr 
et al. [10] which have proven very useful in cancer research (Radmachcr et al. [22]) and 
in the study of HIV drug resistance (Beerenwinkcl et al. [5] ) . 

The models are motivated by the following class of problems. Consider a finite set 
of genetic events, for example, DNA mutations or chromosomal alterations, and assume 
that the genetic changes are permanent. In this situation, each individual, defined by its 
genotype, is completely characterized by the subset of the events that have occurred. We 
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wish to learn the constraints on the orders in which these events have accumulated. A 
CBN is a probabilistic model of this process derived from a partial order on the set of 
events. This partial order encapsulates the dependencies between events. 

For example, consider the development of drug resistance in HIV. This evolutionary 
process is characterized by the accumulation of resistance mutations in the viral genome. 
Under fixed drug pressure, these mutations are virtually non-reversible because they 
confer a strong selective advantage. Thus, the genetic events are fixations of specific amino 
acid substitutions in the virus population. In each patient, different combinations of 
resistance mutations will occur. We seek to determine the prevalent mutational pathways 
along which HIV accumulates drug resistance (cf. Beerenwinkel et al. [6]). An order 
constraint might read that mutations at position 20 and 82 of the target protein must 
occur before we can see a mutation at position 54. This constraint appears in Figure 
4(a). We will analyze such data in Section 4 and see that CBNs are an efficient, accurate 
tool for this problem. 

As another example, the development of cancer is associated with large-scale genetic 
events such as the gains or losses of parts of chromosomes (Iwasa et al. [15]). Knowledge 
of the constraints on the accumulation of these genetic events helps in assessing the 
progression of the cancer and assigning treatments (cf. Rahnenfuhrcr et al. [23]). 

A CBN consists of a set of binary random variables, called events, and a partial order 
on these events. While we will use the language of the theory of poscts, readers can 
cquivalently think of the partial order as a directed acyclic graph (DAG), with edges en- 
coding the order relations. CBNs are specializations of Bayesian networks, the difference 
being that in a CBN, an event cannot occur until all of its parents have occurred. Thus, 
the events that occur with positive probability form a distributive lattice. Distributive 
lattices are important combinatorial objects which have been studied in statistics. For 
example, the LCI models of (Andersson and Perlman [1] and Andersson et al. [2]) use 
distributive lattices to encode conditional independence statements. Although similar in 
spirit, readers should beware that LCI models are not the same as CBNs. 

Our original motivation for studying CBNs came from work on mutagenetic trees, 
introduced in Desper et al. [10]. Mutagenetic trees assume that each event depends on 
the occurrence of at most one other previous event. CBNs relax this assumption, allowing 
for an arbitrary partial order on the events. By relaxing this assumption, CBNs are able 
to model a larger range of biomedical problems effectively. 

Even though they generalize currently used models, CBNs are still very restrictive com- 
pared to Bayesian networks in general. However, CBNs have the benefit that the max- 
imum likelihood parameters and structure can be written down in closed form (Propo- 
sition 2 and Theorem 5). This is an uncommon phenomenon in the theory of graphical 
models and should be of independent interest. In addition, the number of parameters in 
a CBN does not depend on the graph structure, so we do not need to use, for example, 
the AIC or BIC procedures. 

CBNs have also been studied under the name of noisy- AND models in the Al com- 
munity (Meek and Heckerman [17] and Pearl [20, 21]) as a model for causal inference. 
The basic idea is that a number of causes influence a common effect through latent in- 
termediate variables; the noisy-AND model requires all causes to have happened before 
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the effect can occur. The study of these models focuses on learning the causal structure 
given latent variables, in contrast to our situation where we wish to learn the structure 
of a network of observed variables. 

In this paper, we show that CBNs have desirable algebraic, statistical and combina- 
torial properties. CBN models can be learned efficiently, they can be extended to take 
into account noise in the data and they perform better than mutagenetic trees in our 
applications (cf. Figure 3). This paper is organized as follows. After formally introducing 
CBNs in Section 2, we compute the maximum likelihood (ML) estimator for a CBN in 
Section 3 and use this to give a combinatorial characterization of the CBN model of max- 
imal likelihood. Next, in Section 4, we compare the performance of CBNs to mutagenetic 
trees on two data sets of HIV drug resistance mutations. Finally, in Section 5, we study 
algebraic properties of CBNs. These properties are surprisingly similar to other algebraic 
results for statistical models. This material may ultimately become relevant for statistical 
inference, but may also be of independent interest to mathematicians. We determine the 
prime ideal of algebraic invariants of a CBN and show that this model is a toric variety 
in a suitable coordinate system. Our main tool is the Mobius transform, a standard tool 
in working with posets which has found application in the graphical models literature 
(cf. Drton and Richardson [11], Section 3, and Lauritzen [16], page 239). 

2. Conjunctive Bayesian networks 

A CBN model is specified by a set £ of events, a partial order "<" on the events and 
parameters 9 e for each event e. We will assume that there are n events, labeled as 
[n] := {1, . . . , n}. Therefore, we write the parameters as 9 = (#i, . . . , 9 n ). Frequently, we 
will abuse notation and refer to both the model (£,<,#) and the poset (£,<) as £ 
when the meaning is clear from the context. A relation e\ < e^ between two events in £ 
is interpreted as the requirement that event e\ must happen before event ei can. The 
parameter 9 e is the conditional probability that the event e £ £ has occurred, given that 
its predecessor events have already occurred. 

The state space of the CBN model is the distributive lattice Q — J{£) of order ideals 
in £ . An order ideal is a subset gCf such that if e2 G g and e\ < e2, then e\ £ g. Readers 
unfamiliar with posets and their distributive lattices are referred to Beerenwinkel et al. 
[7], Section 2, for a brief introduction. The elements of Q are called genotypes. Thus, 
a genotype g £ Q is a subset of £ or, equivalcntly, the binary string in which each bit 
indicates the occurrence of an event. This terminology presumes a well-defined ground 
state ... in which none of the events have yet occurred. In our examples, the unmutatcd 
virus strain or the unmutatcd potential cancer cell is referred to as the "wild type." Hence, 
for describing mutant types, we need only keep track of which sites differ from the wild 
type because of the assumption of non-reversibility of events. 

We write min(g c ) for the minimal elements in the complement g c = £ \{g} of a geno- 
type g. The elements of min(g c ) are the events that have not occurred in g but could 
happen next. For example, in Figure 1, if g = {1,2}, then min(g c ) = {3,4}. The proba- 
bility of observing the genotype g G Q in. the CBN model on the poset £ is defined to 
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Figure 1. Poset on four events, order ideals and genotype lattice. 



be 

f fl (0)=n>- n o-- *)- 

eeg e6min(g c ) 

That is, the probability of observing g is the probability that all of the events in g have 
happened times the probability that none of the events that depend only on g have 
happened. 

Equivalcntly, the CBN model on £ is the directed graphical model for the binary 
random variables (X e ) e ^£ whose graph has edges e — ► / for all cover relations e < / in £ 
and whose conditional probability tables are 



[Pr(Jf e - b | X pa(e ) - a)] o6 {o,i}P»(o),b 6 {o,i} - 



where pa(e) denotes the parents of e in the acyclic directed graph £. 



1 _ 
1 

. 1 — @e #e _ 



Example 1 . Let n = 4 and suppose £ is the poset defined on four events by the cover 
relations 1<3, 1<4. 2<3 and 2 < 4. The poset £ has precisely seven order ideals, so 
the distributive lattice Q consists of seven genotypes. They are displayed in Figure 1. The 
CBN model £ is the family of probability distributions on Q which is given parametrically 
as follows: 

P (O) = (1 - - 2 ), Pl{0) = *l(l - 02), 

P 2 (0) = 2 (1 - 0^, P 12 (6) = 0x02(1 - 3 )(1 - 0a), 

Pl23i{0) = 0102030A, Pl23(0) = 010203(1 - 8 4 ), 

P 12i (0) = X 0204(1 - 3 ). 



Conjunctive Bayesian networks 



897 



The sum of these seven polynomials equals one. The parameters are the conditional 
probabilities 9 e = Pr(X e = 1 | X p& ^ = (!,..., 1)). 



3. Maximum likelihood estimation 

Consider any CBN model £ on [n] . The data for this model take the form of a function 
u : Q — > N, g i— > u g , where u g is the number of observations of the genotype g. Given such 
data u £ the following proposition gives an easy formula for maximum likelihood 
estimation of the model parameters. 

Proposition 2. For each event e in the CBN model £ , the ML estimator 9 e of 9 e equals 
the relative frequency of the genotypes which contain e among all genotypes that contain 
the events which are strictly below e. In symbols. 



E 0: 



g:eGg U 9 



g:below(e)Cg % 



for all e £ £. 



Proof. The log-likelihood function for the given data u £ N s equals 

*u(0)=$>ff-(l>S*«+ E Iog(l-0e)J. 

gEG \e£g eSmin(g c ) / 

The partial derivative of this expression with respect to a parameter 9 e is 

d£ u A B 



d9 e 9 e 1 — 9 e 

where A is the sum over all frequencies u g of genotypes g containing e and B is the sum 
over all frequencies u g , where e ^ g but below(e) C g. Equating this partial derivative 
with zero, we obtain 



e A + B' 

which is precisely the formula asserted in the proposition. □ 



Example 3. We illustrate Proposition 2 for the model in Example 1 and Figure 1. Since 
below(l) = 0, the ML estimator for 9\ is 

2 Wi+Ui2+ltl23+lil24+Ml234 

"l = 

U + Ul + U2 + Utf + Ui23 + U 12 4 + U1234 

and similarly for 9 2 - For 9%, below(3) = {1,2} and hence 

2 U 123 + "1234 

C7j = . 

"12 + "123 + M124 + M1234 
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The expression for the ML estimator of 9a is similar. 

Remark 4- Proposition 2 shows that the ML estimator for the CBN model is a rational 
function of the data. In the language of Catanese et al. [8] , this says that the ML degree 
of every CBN model is equal to one. 

We identify the elements of Q with strings in {0,1}™. A probability distribution on 
G is thus an element of the (2™ — 1) -dimensional simplex A with coordinates indexed 
by {0,1}™. Write supp(u) for the non-zero coordinates of u, that is, for the genotypes 
that occur in the data set. We say that u separates the events if for any two elements 
e, / £ [n], there exists g £ supp(it) such that g n {e, /} is either {e} or {/}. If this is not 
the case, then we can consider {e,/} as a single event and replace [n] by [n — 1]. 

We call any genotype g C [n] compatible with the model £ if g £ J{£) = G ■ This is 
equivalent to P g {0) not being the zero polynomial; see also Beerenwinkel and Drton 
[3], Definition 14.2. The data u are said to be compatible with £ if all g £ supp(u) 
are compatible with £. Our next theorem is the main result of this section. It gives 
a combinatorial solution to the problem of model selection among CBNs. Here, any 
given data set u: {0, 1}™ — ► N is identified with the corresponding empirical probability 
distribution in A. For such u £ A, we can compute the ML estimator 9 for each poset £ 
on [n] . We define the ML CBN model for u to be the poset £ for which the log-likelihood 
£ u (9) has the largest numerical value. 

Theorem 5. Let u £ A be a probability distribution which separates the events. There is 
then a unique largest poset £ u such that u is compatible with £ u , and the poset £ u is the 
unique ML CBN model for u. 

Here, "largest poset" refers to the refinement relation among posets on [n], that is, 
£ C £' means that every relation e < / in £ also holds in £'. Note that this inclusion is 
reversed for the induced genotype lattices: £ C £' if and only if G = J{£) D G' = J{£')- 

Proof of Theorem 5. The probability P g (6) is identically zero if and only if g is not 
in G = J(£)- This implies that the likelihood function E[ g esupp(ii) Pgi®)™ 9 1S identically 
zero if and only if u is not compatible with the poset £ . Therefore, we need only consider 
posets £ such that u is compatible with £ . 

We claim that there is a unique maximal poset £ u with which u is compatible. Namely, 
£ u is the set of all relations e\ < e^ such that g n {ei,e.2} ^ { e 2} for all g £ supp(it). 
Note that £ u is then an antisymmetric relation on [n] because u separates the events. 
The relation £ u is transitive because g fl {ei,e3} = {e^} implies g n {ei,e2} = {^2} or 
g n {e2,ea} = {63}. Thus, £ u is a poset and adding any relation makes u incompatible 
with it. 

It remains to show that if £ 1 C £2 Q £m then £ 2 is more likely than £1 . It suffices to 
show this where £ 1 and £2 differ by only one relation, which we assume without loss of 
generality to be 1 < 2. Thus, the events 1 and 2 are incomparable in £\, but 1 must come 
before 2 in £%■ 
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Let Q\ = J{£\) and Q 2 = J (£2)- We begin by finding the ML parameters for the two 
models. Write 9 e for the ML parameters for Q\ and t% for the ML parameters for Q 2 . 
According to Proposition 2, we have 



and ?% 



S ff £g i: beW(e)C s W S Egg£; 2: below(e)Cg U S 

Since (5i D C/2 2 supp(u), the numerators of both expressions are the same, that is, we 
are summing the counts u g over all genotypes g that contain e. 

We claim that 8 e = 7% except when e = 2. In both cases, the denominator is the sum 
over u g for all genotypes g where e has either already occurred or is allowed to occur 
next. Since £\ and £ 2 differ in only one relation, 1 < 2, the denominators are the same 
(and hence 6 e = i%) unless e = 2. 

In order to further analyze the ML estimates, we set 



11 „ 



Vi = u g , V 2 = 



^2 "9- 



N= Y, u g , M= Y, u r 

g£(/i:below(2)Cg g£0 2 :below(2)Cg 

With this notation, the maximum likelihood parameters arc 

° 2= N and * = M- 

Note that since event 1 always happens in the data before event 2, we have V2 < Vy. 
Since £ 2 has more conditions than £\ , we have M < N and since event 1 is required to 
happen before event 2 can in £ 2 , we have V\ < M. Combining these inequalities gives us 
V 2 < Vi < M < N. 

Our analysis will involve the ratios of the ML parameters 



7 2 



M 1-6*2 M N-V 2 



(1) 



1% N ' 1 - % N M -V 2 
For i= 1,2, the likelihood function for the given distribution u equals 

M0;ft)=n(n^ 8 r( n ^-^A. 

9 \eSg / \eemin(g c ) / 

Substitute 9 e = 9 e for i = 1 and 6 e = 1% for i = 2. Our assertion states that 

Si)<^t.(%&). 

To prove this, we consider the ratio L u (6;Qi)/L u (rj;Q 2 ), written as a product over g £ 
supp(w), and we examine the four possibilities for gC\ {1,2}, given as follows. 
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Case 1: g = 00*. Here, event 2 can happen in £\, but it cannot yet happen in £2 since 
it requires event 1 to happen first. This contributes a factor (1 — #2) 119 to the product 
over g in L u (6; Qi)/L u (rj; Q2). Since event 2 has not yet happened, there are no factors 
$2/372 hi this product, so everything else cancels. 

Case 2: g = 01*. This case cannot happen by compatibility. 

Case 3: g = 10*. Event 2 has not happened in either case, so all of the terms in the 
product over e £ g cancel. The same set of events can happen in both Q\ and Q2, so 
everything in the product over e G min(g c ) cancels except the factor with e = 2, which 
occurs both in L u (9;Qi) and in L u (rj;Q2)- 

Case 4: g = 11*. This case is similar to case 3, except that event 2 has now happened 
in both cases. 

The result of this analysis is the identity 

L U (9;G 1 )_ jj n ~ 

00* g— 10* 7 g— 11* 

Note that 



(3) 



H M 9+ ]C U 9 = V 1 aild H M 9 =V 2- 

^—10* g=ll* g=ll* 

Therefore, Y!,g=0Q* u g — 1 — V\. Substituting (1) into (2), we obtain 

£ u (g;gi) _ / JV-V2 y~ v Y M(JV-v- 2 ) \ Vl - v YM\ Va 

Lu{%G2) \ N ) \N{M-V 2 )J \NJ 

_ M Vl (N - V 2 ) 1 - V2 
~ N (M-V 2 ) Vl - V2 ' 

The following lemma shows that (3) is less than or equal to one for all 
0<V 2 <Vi<M <N <1. This completes the proof of Theorem 5. □ 

Lemma 6. If x, y, a, b are real numbers with < a <b < x <y <1, then 

x b (y-a) 1 -" 



y (x — a) b a 



<1. (4) 



Proof. We fix a and b and regard the left-hand side of (4) as a function f a ,b{x,y) of x 
and y. The two partial derivatives of this function satisfy 

df a ,b a{x-b) df a , b o(l-y) 

-dx- = ^a)- fa ^ y) and -dy- = ^a)- fa ' b{X ' y) - 

Both expressions are positive on the triangle {[x, y) € M 2 : max(a, b) < x < y < 1}, hence 
fa,b(x,y) is bounded above by / ,b(l,l) = (1 - a) 1-6 < 1. □ 
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We summarize the results of this section in the following algorithm. 

Algorithm 7 (Model selection and parameter estimation for CBN models). 

Input: A probability distribution ueAon the set of genotypes {0, 1}™. 
Output: The ML CBN model £ u and the ML parameters 9. 

Step 1: Check whether u separates the n events. If not, group non-distinguished events 
together, thus decrementing n, and replace u by the probability distribution which is 
induced on the smaller set of genotypes. 

Step 2: Define the poset £ u on [n] as follows. For any two events e, / £ [n], we set e < / 
in £ u if and only if gC\ {e, /} ^ {/} for all g £ supp(w). 

Step 3: For each event e £ [n], compute 9 e by the formula in Proposition 2. 
Step 4: Output the poset £ u and the vector 9 £ [0, 1]". 

4. Application to HIV genetic data 

The use of Algorithm 7 to obtain the ML CBN model is complicated by the presence 
of noise in real-world data sets. Any relation e < / between two events e and / will 
be estimated to be part of the poset £ only if no genotype which contains / but not 
e has been observed. Thus, the algorithm will miss relations e < / that have strong, 
but imperfect, support. The problem of noisy data has been analyzed in earlier work 
on mutagenetic tree models. It can be addressed by explicit error models in an ML 
framework, as described in Bccrcnwinkel and Drton ([3], Section 14.2) and Bccrcnwinkcl 
and Drton [4], Section 3.3. Also, Szabo and Boucher [27] have incorporated an error 
model directly into the reconstruction algorithm of Desper et al. [10]. 

We propose the following method for constructing a range of CBN models as the error 
tolerance e varies. Let £ e be the poset on [n] which consists of all relations e < / which 
are violated by at most a fraction e of the data. Thus, for e = 0, we recover £ u . Generally, 
some observations g £ supp(w) will be incompatible with the model £ e . These samples 
are removed prior to ML estimation of the model parameters 9. In order to account for 
both the compatible and incompatible data, we use a simple mixture model. 

Write Q t = J(£ e ) for the genotype space of the model £ e . We assume that the incompat- 
ible genotypes g ^Q e are generated with uniform probability l/(2 n — \0 e \). Our mixture 
model £' e is given parametrically by the event probabilities 9 e and a mixture parameter 
A as 



for each observation g £ {0,1}". This expression gives an explicit trade-off between a 
large number of compatible samples and a good model fit. 

Since the mixing distributions of the model £' e have disjoint support, the log-likelihood 
function of the data u: {0, 1}™ — > N decomposes as follows: 




if g eg,, 



i' u {6,\)= 5> g [logA + logP ff (0)] 
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_ (5) 
+ 5»g(l-A)-log(2"-|&|)]. 

Proposition 8. The ML estimators 9 e of 9 e under the model £' e are given by Proposition 
2. The ML estimator X of X under the model £' e is given by the fraction of the data u 
which is compatible with the model £ t . That is, 



X- 



E 



sea, u 9 



9 U 9 



E 



Proof. The partial derivatives of (5) with respect to 9 e are the same as they were in 
Proposition 2. Next, if we solve 

d£ 



dx ^ x ^ i 



_ — A' 

sea, g£Qe 

we obtain the above formula for A. □ 



We now apply these methods to mutation data from HIV that was obtained from pa- 
tients under antirctroviral therapy. The set £ of genetic events consists of seven amino 
acid alterations in the HIV genome that confer drug resistance. Specifically, as an un- 
ordered set, 

£ = {K20R, M36I, M46I, I54V, A71V, V82A, I84V}, 

where, for example, K20R indicates the amino acid mutation from lysine (K) to argininc 
(R) at position 20 of the HIV protease. We consider two datasets from the Stanford 
HIV Drug Resistance Database (Rhee et al. [24]), which consist of 112 and 691 observed 
genotypes under therapy with the protease inhibitors ritonavir (RTV) and indinavir 
(IDV), respectively. 

Previous studies identified correlations and preferred pathways among the resistance 
mutations (Condra et al. [9] and Molla et al. [18]). In particular, in Beerenwinkel et al. 
[7], we used mutagenctic trees to infer the underlying dependency structure. The posets 
are displayed in Figure 2. 

For each dataset, we built posets £ e for various values of e. For each estimated poset, 
we report two numbers: the log-likelihood of the data given the mixture model £' t and 
the mixture parameter A (i.e., the fraction of the data which was explained by the model 
£ e ). We also calculated these numbers for the mutagenetic trees (Figure 2). These results 
are shown in Figure 3. Software for building the posets £ e and computing the likelihood 
is available at http://bio.math.berkeley.edu/CBN/. 

The two posets that maximize £' u for RTV and IDV, respectively, are displayed in 
Figure 4. Note that almost all constructed CBNs £ e performed better than the muta- 
genetic trees. In order to estimate the significance of this difference, we repeated the 
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log-likelihood calculation for each poset using 1000 bootstrap samples from the original 
data. The difference in log-likelihood between these optimal posets and the mutagenetic- 
tree- induced posets is sufficiently large that their distributions derived from the bootstrap 
analysis never overlapped. Thus, the difference between the optimal CBN models and 
the mutagenetic trees is found to be highly significant. 

Comparing the optimal CBNs (Figure 4) to the mutagenetic trees (Figure 2) suggests 
that the mutagenetic trees may induce too many relations and may be handicapped by 
the requirement that the output is a tree. The posets for RTV share two relations (V82A 
< M46I and V82A < I54V), while those for IDV share none. The RTV poset [Figure 4(a)] 
includes the conjunction that both mutations K20R and V82A must occur before 154V, 
which cannot be represented in a mutagenetic tree. By contrast, the IDV poset [Figure 
4(b)] could be represented by a mutagenetic tree, but this tree has not been found by the 
tree-building procedure of Desper et al. [10] . Although the posets and trees do not share 
many relations, they display a similar structure in that the development of ritonavir 
resistance is a much more ordered process than for indinavir (see also Beerenwinkel et 
al. [7]). 

This comparison suggests several advantages of CBNs. First, they provide better model 
fits than the posets derived from the mutagenetic tree models. Second, they rely on 
an ML method both for parameter estimation and for model selection. This stands in 
contrast to the algorithm of Desper et al. [10], which is not an ML procedure. Finally, the 
perturbed CBNs £ e can cover a wide range of fractions of unexplained samples, providing 
a "parametric" picture of the relations present in the data. 



5. Algebraic study of the CBN model 

In this final, section we study CBNs from the perspective of algebraic statistics. Follow- 
ing Pachter and Sturmfels [19], we regard a CBN as an algebraic variety in a space of 



M:jgt 
t 

K20R, 
t 

IMV A71V 
t t 

M46I I54V K20R I54V A71V I84V 

\ / t \ / t 

V82A V82A M46I 



(a) RTV (b) IDV 

Figure 2. Posets corresponding to the mutagenetic trees that were found in Beerenwinkel et 
al. [7], Figure 3, for (a) ritonavir (RTV) and (b) indinavir (IDV). 
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Mutagen else tree 



0.1 0.2 0.3 0.4 0.5 
Fraction of incompatible data 

(a) RTV 



0.6 



0.7 




0.05 0,1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 
Fraction of incompatible data 

(b) IDV 

Figure 3. Log-likelihood £' u for the CBN models E t (filled circles) for various choices of the 
error tolerance e as a function of the fraction of incompatible genotypes g £Q t . The filled squares 
correspond to the trees shown in Figure 2. Quartile bars have been derived from 1000 bootstrap 
samples. Subfigures correspond to (a) ritonavir (RTV) and (b) indinavir (IDV). 



dimension \Q\. The objective is to compute the prime ideals of all polynomials which 
vanish on this variety. These polynomials are the algebraic invariants of the CBN model. 
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M36I I54V 



A71V K20R I84V M46I V82A 
(b) IDV 



Figure 4. Conjunctive Bayesian networks for (a) ritonavir (RTV) and (b) indinavir (IDV) that 
maximize the likelihood £' u . The posets represent the models corresponding to the maxima of 
the graphs shown in Figure 3. 



Example 9. For the model with four events and seven genotypes in Example 1, the 
algebraic invariants are generated by the three polynomials 

Pl23 • P124 - Pl2 ■ P1234, Pi " f>2 ~ Vz ' Pl2 ~ Pes ' Pl23 ~ Pes ' Pl2i ~ Pz ' Pl234 

and 

Pz + Pl +P2+PV2 +P123 +P124 +P1234 - 1- 

Indeed, these three expressions vanish identically if we replace p g by P g {&) for each 
genotype g. Here "are generated" means that every other polynomial with this property 
is a linear combination of the three polynomials. 

The main theorem in this section exhibits an explicit Grobner basis for the algebraic 
invariants of any CBN model. This Grobner basis consists of a set of quadratic polyno- 
mials, together with the trivial invariant YlgegPg — 1, exactly as in Example 9. For the 
special case where £ is a forest, this result was proven in Beerenwinkel and Drton [3], 
Theorem 14.11. Other widely used statistical models have Grobner bases of the same 
form, for example, decomposable Markov random fields (Geiger et al. [12]) and Jukes- 
Cantor models in phylogenetics (Sturmfels and Sullivant [26]). Among Markov random 
fields, having a Grobner basis of quadrics is equivalent to having ML degree one (Geiger 
et al. [12], Theorem 4.4). This suggests a possible relationship between Theorem 5 and 
Theorem 10 below. The analogy to Jukes-Cantor models is noteworthy. These models 
become toric varieties after a linear change of coordinates, known as the Fourier trans- 
form or Hadamard conjugation. The same property will be shown in Corollary 11 for the 
CBN models, but the role of the Fourier transform is now played by Mobius inversion on 
the distributive lattice Q . 

To state our algebraic results, we regard the probabilities p gi for each genotype g in 
Q = J{£), as unknowns. These generate the polynomial ring 



R[g}=R[p g :g&g}. 
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In this ring, we consider the prime ideal Is consisting of all polynomials that vanish on 
the family of probability distributions defined by the CBN model £. Equivalcntly, Is is 
the kernel of the ring map M.[Q] —>M.[£],p g i— > P g (0), where R[£] is the polynomial ring 
generated by the parameters 6 e , e e £. 

We fix a linear extension of the reverse inclusion order on Q, where g — is the 
largest element and g = £ is the smallest clement. We define -< to be the degree reverse 
lexicographic monomial ordering on R.[Q] induced by the variable ordering given by the 
fixed linear extension. 

Theorem 10. The reduced Grobner basis of the ideal Is with respect to the monomial 
ordering -< consists of the trivial linear invariant ^2, g ^gPg — 1, with leading term p®, 
together with one homogeneous quadratic polynomial 

P g -Ph ~ Pguh ■ Pgnh + slower terms (6) 

for each incomparable pair of genotypes {g, h} in the distributive lattice Q. 

Proof. We start our proof of Theorem 10 by recalling that the sum of the polynomials 
P g (9), equals one. If we take the subsum of all polynomials P g {0), where g runs over all 
genotypes containing some fixed genotype h £ Q, then this is essentially the same sum 
with £ replaced by £\h and we conclude that 

£ Pg(9) = l[8 e . 

g:hC.g eEh 

This expression represents the probability that each event in h has happened. The identity 
suggests that we perform the following linear change of coordinates in the polynomial 
ring R[Q]: 

q h := ^ Pg for a11 h e 0- ( 7 ) 

g:hCg 

Thus, in the new coordinates qh , the CBN model is precisely the toric variety associated 
with the distributive lattice Q. A well-known theorem of Hibi [14] states that the ideal 
of this toric variety is generated by the binomials 

ffg • Qh — q g uh ■ q g nh, (8) 

where {g,h} runs over all incomparable pairs of elements of Q. Moreover, these binomi- 
als form a Grobner basis with the underlined terms as the leading terms. Thus, Is is 
generated by the quadrics (8) together with the relation q& — 1, which is obtained from 
(7) under the assumption that the probabilities p g sum to one. Now, if we rewrite (8) in 
terms of the original coordinates p gi then we obtain quadrics of the form (6). 

We claim that the quadrics (8) in the original coordinates p g form a Grobner basis for 
Is- This Grobner basis will be minimal, but not reduced. We shall verify the Grobner 
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basis property by using the theory of sagbi bases (or canonical bases), as described by 
Sturmfels [25], Section 11. 

Let < denote a negative degree monomial ordering on the polynomial ring of param- 
eters, R[£] = M.[8 e : e £ £]. Thus, < is a local monomial ordering in which 1 = 9 is the 
largest monomial and monomials of higher total degree are <-smallcr than monomials 
of lower total degree. See Greuel and Pfister [13] for an introduction to local monomial 
ordcrings. 

We shall prove that the coordinate polynomials P g (0) of the CBN model form a sagbi 
basis for the local ordering <, that is, the <-lcading monomials 

in<(P 9 (0))=;Q0 e (9) 

eeg 

generate the algebra of all <-leading monomials of polynomials in the image of our ring 
map M.[Q] — > R[£], p g i— ► Pg(0). Let J £ be the prime ideal in R[C?] consisting of all algebraic 
relations on the initial monomials (9). By Hibi's result, Js is generated by the binomials 
Pg m Ph —Pguh -PgCih and these binomials form the reduced Grobner basis of Js with respect 
to -<. 

Let w £ R be a weight vector which represents the local ordering < for the coordinate 
polynomials P g (0) and let A T w be the induced weight vector in R e . By Sturmfels [25], 
Lemma 11.2, we have 

m ATw {I £ )CJ £ . (10) 

Importantly, p g ■ p^ — p g uh • P g nh is the initial form of (8) with respect to A T w, so the 
reverse inclusion also holds. Thus, equality holds in (10) and the desired sagbi basis 
property holds by Sturmfels [25], Theorem 11.4. 

By Sturmfels ([25], Corollary 11.6(a)), we conclude that the quadratic model invariants 
(8) form a Grobner basis of Is with respect to H. This Grobner basis is minimal and 
it can be transformed into the reduced Grobner basis by autoreduction. This completes 
the proof of Theorem 10. □ 

A few remarks arc in order. The linear transformation between the p-coordinates and 
the ^-coordinates on the polynomial ring M.[Q], given in (7), is precisely the Mobius inver- 
sion on the genotype lattice Q. Equivalcntly, the coefficients (+1, —1 or 0) of the mono- 
mials in the expanded model coordinates P g {9) are precisely the values of the Mobius 
function on Q. 

Example 9 illustrates Theorem 10 for the model in Example 1. The three model invari- 
ants form a Grobner basis with leading terms pi2z • Pi2i, Pi -P2 and p<z- Mobius inversion 
on the genotype lattice pictured in Figure 1 gives 

Pz = qz- 91-92+912, Pi =<7l -?12, P2 = 92-<?12, 

Pl2 = <7l2 — 9123 — 9124 + 91234, P1234 = 91234, 



P123 — 9123 — 91234, 



P124 



— 9124 — 91234- 
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If we perform these substitutions in the reduced Grobner basis listed in Example 9, then 
the three given model invariants simplify to 

90-1, q\ ■ 92 - to • 9l2, 9123 ' 9124 ~ 91234 ' 9l2- 

Corollary 11. The Mobius inversion (7) on the distributive lattice Q = J{£) is a linear 
change of coordinates which identifies the CBN model £ with the toric variety of the 
distributive lattice Q defined by Hibi. 

We close with the remark that the sagbi basis property of the coordinate polynomials 
of the CBN model, which was established in the course of proving Theorem 10, can be 
used to express any polynomial in the coordinate subalgebra rapidly in terms of the 
generators P g {0). This process, which is known as subduction (Sturmfels [25], Algorithm 
11.1), generalizes the classical procedure of expressing any symmetric polynomial in terms 
of elementary symmetric functions and may be of interest to statisticians. 
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